Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition #17332

pwilkin · 2025-11-17T21:59:02Z

CISC · 2025-11-17T22:06:45Z

Add testcase or it didn't happen. :)

pwilkin · 2025-11-17T22:27:41Z

Add testcase or it didn't happen. :)

Look, children, that's how an evil maintainer looks like :P Will never let you off the hook with any PR, ever!

pwilkin · 2025-11-17T22:29:30Z

update_cuda_graph_executable: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
update_cuda_graph_executable: CUDA graph update failed
  CONT(type=f32,ne=[10,10,10,1]): OK
  CONT(type=f32,ne=[2,1,1,1]): OK
  CONT(type=f32,ne=[2,1,3,5]): OK
  CONT(type=f32,ne=[2,3,5,7]): OK
  CONT(type=f16,ne=[2,1,1,1]): OK
  CONT(type=f16,ne=[2,1,3,5]): OK
  CONT(type=f16,ne=[2,3,5,7]): OK
  CONT(type=bf16,ne=[2,1,1,1]): OK
  CONT(type=bf16,ne=[2,1,3,5]): OK
  CONT(type=bf16,ne=[2,3,5,7]): OK
[CONT] NMSE = 0.447623183 > 0.000000100   CONT(type=f32,ne=[1,4,2,1]): FAIL
[CONT] NMSE = 2.241813873 > 0.000000100   CONT(type=f32,ne=[1,8,17,1]): FAIL
[CONT] NMSE = 0.058848433 > 0.000000100   CONT(type=bf16,ne=[1,4,2,1]): FAIL
[CONT] NMSE = 1.181509486 > 0.000000100   CONT(type=bf16,ne=[1,8,17,1]): FAIL
  10/14 tests passed

vs

  update_cuda_graph_executable: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
update_cuda_graph_executable: CUDA graph update failed
  CONT(type=f32,ne=[10,10,10,1]): OK
  CONT(type=f32,ne=[2,1,1,1]): OK
  CONT(type=f32,ne=[2,1,3,5]): OK
  CONT(type=f32,ne=[2,3,5,7]): OK
  CONT(type=f16,ne=[2,1,1,1]): OK
  CONT(type=f16,ne=[2,1,3,5]): OK
  CONT(type=f16,ne=[2,3,5,7]): OK
  CONT(type=bf16,ne=[2,1,1,1]): OK
  CONT(type=bf16,ne=[2,1,3,5]): OK
  CONT(type=bf16,ne=[2,3,5,7]): OK
  CONT(type=f32,ne=[1,4,2,1]): OK
  CONT(type=f32,ne=[1,8,17,1]): OK
  CONT(type=bf16,ne=[1,4,2,1]): OK
  CONT(type=bf16,ne=[1,8,17,1]): OK
  14/14 tests passed
  Backend CUDA0: OK

@CISC There :)

JohannesGaessler

Thank you for the fix and sorry for not catching the bug during review; some functionality that would have covered this was removed and the logic was not adjusted.

Preferably add an argument for the existing tests for GGML_OP_CONT rather than adding a new test case.

pwilkin · 2025-11-18T12:41:26Z

@JohannesGaessler okay, integrated it into the CONT test, added a flag use_view_slice that decides whether you want transpose or view with a slice to achieve incontiguity. Added a few more tests for completeness.

JohannesGaessler

Do you have the necessary permissions to hit the merge button or do I need to do it?

tests/test-backend-ops.cpp

pwilkin · 2025-11-18T12:54:26Z

Do you have the necessary permissions to hit the merge button or do I need to do it?

Nope, don't have write access. But I'll address the Evil Maintainer's concerns first :D

pwilkin · 2025-11-18T13:03:27Z

Alright, can merge. But you know @CISC that once you gave me this idea I'm going to run around and refactor all the tests now to use loops? :P

tests/test-backend-ops.cpp

JohannesGaessler · 2025-11-18T14:21:29Z

@pwilkin please do a more general refactor of tests in a dedicated PR.

pwilkin · 2025-11-18T14:26:35Z

@pwilkin please do a more general refactor of tests in a dedicated PR.

Of course, it was meant as a generalized ~~threat~~promise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

pwilkin · 2025-11-18T14:28:08Z

(and BTW, don't worry, I'm not opening a new refactoring till I'm done with the convert_hf_to_gguf.py one)

CISC · 2025-11-18T14:29:21Z

(and BTW, don't worry, I'm not opening a new refactoring till I'm done with the convert_hf_to_gguf.py one)

Great, I'll keep you busy with changes there then. 👿

JohannesGaessler · 2025-11-18T16:42:28Z

Of course, it was meant as a generalized threatpromise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

The very first failed CI job I checked failed because the WebGPU backend crashes on the newly added tests. Please remove the tests from this PR, then we can merge only the fix for CUDA and make a new PR with the tests that we can merge once all backends work correctly.

CISC · 2025-11-18T16:55:06Z

Of course, it was meant as a generalized threatpromise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

The very first failed CI job I checked failed because the WebGPU backend crashes on the newly added tests. Please remove the tests from this PR, then we can merge only the fix for CUDA and make a new PR with the tests that we can merge once all backends work correctly.

It doesn't actually fail on the test BTW, it fails because with F16 the view offset of [1,4,2,1] tensor is not a multiple of 4. cc/ @reeselevine

pwilkin · 2025-11-18T17:12:35Z

@CISC @JohannesGaessler So wait, should I remove the F16 testcase or is it actually something that should be fixed on the WebGPU backend?

CISC · 2025-11-18T17:17:34Z

@CISC @JohannesGaessler So wait, should I remove the F16 testcase or is it actually something that should be fixed on the WebGPU backend?

You can change it to a tensor where the view offset is a byte-multiple of 4 for now.

pwilkin · 2025-11-18T17:39:57Z

Okay, I've taken out the F16/BF16 tests with use_view_slice for now.

BTW, I think the problem might be more subtle, because the tensor {1, 4, 2, 1} is a multiple of 4. What happens is that the view slice offset calculated is possibly not a subset of 4.

CISC · 2025-11-18T17:40:27Z

Okay, I've taken out the F16/BF16 tests with use_view_slice for now.

BTW, I think the problem might be more subtle, because the tensor {1, 4, 2, 1} is a multiple of 4. What happens is that the view slice offset calculated is possibly not a subset of 4.

Yeah, I edited my misleading comment. :)

pwilkin · 2025-11-18T18:18:16Z

@JohannesGaessler WebGPU succeeded this time so I think it's safe to merge.

reeselevine · 2025-11-18T19:11:54Z

Of course, it was meant as a generalized threatpromise :) You can merge this one I think (a lot of CI will be failing due to the CloudFlare problems).

The very first failed CI job I checked failed because the WebGPU backend crashes on the newly added tests. Please remove the tests from this PR, then we can merge only the fix for CUDA and make a new PR with the tests that we can merge once all backends work correctly.

It doesn't actually fail on the test BTW, it fails because with F16 the view offset of [1,4,2,1] tensor is not a multiple of 4. cc/ @reeselevine

Yep, looks like this is a bug in the WebGPU set_tensor function:
https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-webgpu/ggml-webgpu.cpp#L1500-L1518

WebGPU has more stringent alignment requirements for portability. This code considers the cases where the offset + size is not a multiple of 4, but not when the starting offset is not a multiple of 4!

Should be a relatively easy fix. I might prefer to fix it in the same PR that adds back the test cases that currently fail, so I am happy to work with whoever creates that PR on the fix.

JohannesGaessler · 2025-11-18T19:52:45Z

@pwilkin my opinion is that we should merge the fix for CUDA earlier rather than later. That is the important part; the main purpose of the tests is to avoid having the same breakage again later on. But if the addition of a test case breaks the tests themselves we should put those into a different PR in the meantime. And we should definitely be keeping the problematic case for webGPU rather than work around it.

pwilkin · 2025-11-18T19:58:20Z

@JohannesGaessler yeah, I kept the regression tests, I just added a skip (and TODO) for the WebGPU cases.

pwilkin · 2025-11-18T20:00:22Z

@reeselevine Once this gets merged you can just remove the continue block with the TODO comment for the failing tests.

pwilkin added 2 commits November 17, 2025 22:57

Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition

c2bfbbf

Argh.

5e7c26f

pwilkin mentioned this pull request Nov 17, 2025

Model: Qwen3 Next #16095

Open

pwilkin requested review from JohannesGaessler and am17an November 17, 2025 22:00

Making CISC happy ;)

d51f719

pwilkin requested a review from slaren as a code owner November 17, 2025 22:29

DajanaV mentioned this pull request Nov 17, 2025

UPSTREAM PR #17332: Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition auroralabs-loci/llama.cpp#244

Open

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 18, 2025

JohannesGaessler approved these changes Nov 18, 2025

View reviewed changes

Integrate CONT tests

f378da9

JohannesGaessler approved these changes Nov 18, 2025

View reviewed changes

CISC reviewed Nov 18, 2025

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

Use loopy loop

ccd0dfa

am17an reviewed Nov 18, 2025

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

Skip new tests for (B)F16 for now.

3980b6b

Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition #17332

Are you sure you want to change the base?

Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition #17332

Conversation

pwilkin commented Nov 17, 2025

Uh oh!

CISC commented Nov 17, 2025

Uh oh!

pwilkin commented Nov 17, 2025

Uh oh!

pwilkin commented Nov 17, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Nov 18, 2025

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

CISC commented Nov 18, 2025

Uh oh!

JohannesGaessler commented Nov 18, 2025

Uh oh!

CISC commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

CISC commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

CISC commented Nov 18, 2025

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

reeselevine commented Nov 18, 2025

Uh oh!

JohannesGaessler commented Nov 18, 2025

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

pwilkin commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CISC commented Nov 18, 2025 •

edited

Loading

CISC commented Nov 18, 2025 •

edited

Loading